Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gnomAD v4 short variants #1178

Merged
merged 106 commits into from
Nov 1, 2023
Merged

Add gnomAD v4 short variants #1178

merged 106 commits into from
Nov 1, 2023

Conversation

mattsolo1
Copy link
Contributor

@mattsolo1 mattsolo1 commented Sep 15, 2023

Addition of gnomAD v4 short variants

Major update to add support for v4 short variants to browser.

Pipeline changes

This PR includes some updates to the gnomAD pipeline tool chain, tests, and modifications to the pipeline configuration. There is also the start of the v4 variant/coverage pipeline.

Still a draft & not yet confirmed for breaking changes to other pipelines; but feel free to look and provide initial feedback

Tool chain changes

Dependency management with poetry

Previously we had only a requirements.txt

Now dependencies installed with poetry, kept in pyproject.toml

This creates a lock file with exact versions.

Then generate requirements.txt with:

poetry export --with dev >requirements.txt

or by running update-requirements.sh

UPDATE: you don't have to use poetry if you don't touch dependencies; just use requirements.txt.

Linting

Pylint -> Ruff

  • Faster
  • More tools in 1 (e.g., flake8 + isort)
  • Used by many open source projects
  • Works better with editors

https://docs.astral.sh/ruff/

Black

I haven't done this yet, but I'd suggest decreasing line length 120 -> 88?

Type checking & input validation

Classes are defined with @attr.define

Example:

import attr

@attr.define
class Post:
    text: str

@attr.define
class User:
    user_id: int
    username: str
    email: str
    age: int
    posts: List[Post]


new_user = User(
    user_id=1, 
    username="Alice",
    email="[email protected]",
    age=25,
    status="active")  # pyright will raise a type error here

Advantages:

Nice editor experience!

More readable (compared to using dictionaries or having no types).

Use create() to instantiate classes instead of __init__

Use structure and unstructure utility functions to:

  • deserialize/serialize
  • validate inputs
  • Make nested types

Logging

Suggestion use Loguru instead of std lib logging (see their website for the advantages).

Input/output validation

Using ChatGPT, hail schemas, and attrs/cattrs libraries, it is possible to quickly create classes that can be used to validate pipeline I/O.

e.g.

@attr.define
class Step3Variant:
    locus: Locus
    alleles: list[str]
    grpmax: List[Grpmax]
    rsids: Union[Set[str], None]
    rf: Rf
    in_silico_predictors: InSilicoPredictors
    variant_id: str
    colocated_variants: ColocatedVariants
    gnomad: Gnomad # gnomad-specific data
    subsets: set[str]
    flags: set[str]
    transcript_consequences: Union[List[TranscriptConsequence], None]

Then validate:

def test_validate_step3_output():
    output_path = gnomad_v4_variant_pipeline.get_task(
        "annotate_gnomad_v4_exome_transcript_consequences"
    ).get_output_path()
    ht = hl.read_table(output_path)
    result = ht_to_json(ht)
    [structure_attrs_fromdict(variant, Step3Variant) for variant in result]

Checks for missing data too.

Comitting hail schemas to source control

write_schemas.py

Iterates through all hail inputs/outputs for a given pipeline & writes them to schemas dir.

e.g.

/schemas/gnomad_v4_variants/annotate_gnomad_v4_exome_transcript_consequences/output/gnomad_v4_variants_annotated_2.ht.schema

----------------------------------------
Global fields:
    'freq_meta': array<dict<str, str>> 
    'freq_index_dict': dict<str, int32> 
    'faf_meta': array<dict<str, str>> 
    'faf_index_dict': dict<str, int32> 
    'freq_sample_count': array<int32> 
    'filtering_model': struct {
        model_name: str, 
        score_name: str, 
        feature_medians: dict<tuple (
            str
        ), struct {
etc

This is analogous to our jest snapshots. If anything in the pipeline changes, the schema will be tracked in source control. And it's nice to have as a reference.

https://github.com/broadinstitute/gnomad-browser/pull/1178/files#diff-0811404f3bc0e96d8cd34bae975ef717b89943056daace9d6a5c280515412762R41-R54

Modifications to core Pipeline class

Improved IO config

Previously, one would run a pipeline with an --output-root, a required arg.

Data inputs were hardcoded, or relied on other pipelines that may not exist.

Makes it hard to develop pipelines in isolation.

The Pipeline class changed to make this an optional arg, accepts a config object that allows specifying both input & output dirs.

A new PipelineMock class makes it possible to stub other pipeline outputs.

This makes it possible to easily switch between local dev datasets & prod full datasets.

@mattsolo1 mattsolo1 force-pushed the gnomad-v4 branch 2 times, most recently from 604f868 to ab01ef3 Compare September 22, 2023 17:26
@mattsolo1 mattsolo1 changed the title Add gnomAD v4 variant and coverage pipeline Pipeline tool changes for gnomAD v4 Sep 28, 2023
Copy link
Contributor

@rileyhgrant rileyhgrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking quite cool, lots of great stuff that I'm excited to get acquainted with.

In general I'm pretty excited about the typing you've been adding, in addition to the making it far easier to create tests for pipelines locally with the dataset config.

In regards to the tooling:

  • Poetry, I've had a few difficulties with this so far, likely some significant amount of this is user error and/or some quirk of my machine, but Poetry does not consistently activate a virtual environment with poetry shell for me, even to the point of telling me its active when running which python shows that is most certainly is not active. So that's just to say I'm glad that at least if I'm not messing with dependencies it still can output a generic requirements.txt file. I'm curious to hear Phil's thoughts on Poetry.
  • Ruff, I'm sold
  • Black, I'm sold on shortening the line length. It appears as though 88 is their default value, so I'm on board with using their choice, as they're in the business of having opinions.
  • Type checking with pyright + attrs, this syntax does seem nice to quickly define types.
  • Loguru, I'm sold.
  • I/O validation with types, yet again, I'm sold, seems like a one time big cost then a bit of upkeep for huge benefits
  • Commiting Hail schemas to Source Control, this seems like it could have some nice benefits, what is the intended workflow with this? Is it to locally run the python script to generate the schemas, then check for a diff in between the old one and the new one, before running a pipeline? In that case do we remember to manually run this on our machine, and check these new ones into source control, then having some CI check for a difference a la automated Jest tests? Or is this more just to always be able to reference the old schema, I know I've spent enough time making a tiny script to run describe on a dataset because I don't have the schema handy
  • Dataset Configs, I see how this is useful for being able to generate tests for entire pipelines. Would a small test for a given pipeline require replicating the entire pipeline in a test file (running all the same tasks in the same order) with a different dataset config that dictates a different input and output? Or is the benefit here being able to test the code for the actual pipeline at all, not what a particular pipeline does in its own logic?

In general, I haven't gotten my hands too dirty with these proposed changes in a firsthand way yet, but I'm excited about the suggestions and to learn from the knowledge you've gained from other projects.

@mattsolo1
Copy link
Contributor Author

Thank you for the feedback, Riley!

Poetry, I've had a few difficulties with this so far

Interesting. Poetry isn't a hard requirement for me, but I do find it makes it significantly easier to keep dependencies up to date and to know exactly what versions are being used. I'm happy to remove this if people would like, in favor for a simpler requirements.txt without pinned versions.

Committing Hail schemas to Source Control, this seems like it could have some nice benefits, what is the intended workflow with this?

do we remember to manually run this on our machine, and check these new ones into source control, then having some CI check for a difference a la automated Jest tests

Basically yes I would see us having a shared "test dataset" in GCS somewhere that is downloaded locally with a pipeline step, the pipeline is executed, and then a diff is checked in CI. Ideally this is not strictly manual; the idea is to mimic Jest-style snapshots. And yes having a browsable schema in source is a pretty significant bonus, IMHO.

Would a small test for a given pipeline require replicating the entire pipeline in a test file (running all the same tasks in the same order) with a different dataset config that dictates a different input and output? Or is the benefit here being able to test the code for the actual pipeline at all, not what a particular pipeline does in its own logic?

Not quite sure what is being asked here, but happy to chat in person to clarify. I think it's the later; having a systematic way to run a small dataset through the pipeline is useful for prototyping and making changes.

Copy link
Contributor

@phildarnowsky-broad phildarnowsky-broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants